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Abstract. We present algorithms for checking and enforcing robustness 
of concurrent programs against the Total Store Ordering (TSO) memory 
model. A program is robust if all its TSO computations correspond to 
computations under the Sequential Consistency (SC) semantics. 
We provide a complete characterization of non-robustness in terms of 
so-called attacks: a restricted form of (harmful) out-of-program-order 
executions. Then, we show that detecting attacks can be parallelized, 
and can be solved using state reachability queries under SC semantics 
in a suitably instrumented program obtained by a linear size source-to- 
source translation. Importantly, the construction is valid for an arbitrary 
number of addresses and an arbitrary number of parallel threads, and it 
is independent from the data domain and from the size of store buffers 
in the TSO semantics. In particular, when the data domain is finite and 
the number of addresses is fixed, we obtain decidability and complexity 
results for robustness, even for an arbitrary number of threads. 
As a second contribution, we provide an algorithm for computing an op- 
timal set of fences that enforce robustness. We consider two criteria of 
optimality: minimization of program size and maximization of its perfor- 
mance. The algorithms we define are implemented, and we successfully 
applied them to analyzing and correcting several concurrent algorithms. 



1 Introduction 

Sequential consistency (SC) |21j is a natural shared-memory model where the 
actions of different threads are interleaved while the program order between 
actions of each thread is preserved. However, for performance reasons, modern 
multiprocessors implement weaker memory models relaxing program order. For 
instance, the common store-to-load relaxation, which allows loads to overtake 
stores, reflects the use of store buffers. It is actually the main feature of the TSO 
(Total Store Ordering) model adopted, e.g., in x86 machines [27] . 

Nonetheless, most programmers usually assume that memory accesses are 
performed according to SC where all shared-memory accesses are instanta- 
neous and atomic. This assumption is actually safe for data-race-free programs 
[3J, but in many situations data-race- freedom does not apply. This is, for in- 
stance, the case of programs implementing synchronization operations, concur- 
rency libraries, and other performance-critical system services employing lock- 
free synchronization. In most cases, programmers design programs that are ro- 
bust against relaxations, i.e., for which relaxations do not introduce behaviors 
that are not allowed under SC. Memory fences must be included appropriately 



in programs in order to prevent non-SC behaviors. Getting such programs right 
is a notoriously difficult and error-prone task. Therefore, important issues in this 
context are (1) checking robustness of a program against relaxations of a mem- 
ory model, and (2) identifying a set of program locations where it is necessary 
to insert fences to ensure robustness. 

In this paper we address these two issues in the case of TSO. We consider a 
general setting without fixed bounds on the shared memory size, nor on the size of 
the store buffers in the TSO semantics, nor on the number of threads. This allows 
us to reason about robustness of general algorithms without assuming any fixed 
values for these parameters that depend on the actual machine's implementation. 
Moreover, we tackle these issues for general programs, independently from the 
domain of data they manipulate. 

Robustness against memory models has been addressed first by Burckhardt 
and Musuvathi in [9 (actually, for TSO only), and subsequently by Burnim et 
al. in [TU]. Alglave and Maranget developed a general framework for reasoning 
about robustness against memory models in |4I5) (where the term stability is 
used instead of robustness). Roughly, these works are based on characterizing 
robustness in terms of acyclicity of a suitable happens-before relation. In that, 
they follow the spirit of Shasha and Snir [28] who introduced a notion of trace 
that captures the control and data dependencies between events of an SC compu- 
tation, and established that computations that are not SC have a happens-before 
relation that contains a cycle. We adopt here the same notion of robustness, i.e., 
a program is (trace-)robust if each of its TSO computations has the same trace 
as some of its SC computations. 

From an algorithmic point of view, the existing works mentioned above do 
not provide decision procedures for robustness. |9|10) provide testing procedures 
based on enumerating TSO runs and checking that they do not produce happens- 
before cycles. Clearly, while these procedures can establish non-robustness, they 
can never prove a program robust. On the other hand, [5] provides a sound 
over-approximate static analysis that allows to prove robustness, but may also 
inaccurately conclude to non-robustness and insert fences unnecessarily. We are 
interested here in developing an approach that allows for precise checking of 
trace- robustness, and for optimal fence insertion (in a sense defined later). 

In our previous work [5], trace- robustness against TSO has been proved to be 
decidable and PSPACE-complete, even for unbounded store buffers, in the case 
of a fixed number of threads and assuming a fixed number of shared variables, 
ranging over a finite data domain. The method that shows this decidability 
and complexity result does not provide a practical algorithm: it is based on a 
non-deterministic, bounded enumeration of computations. Moreover, it does not 
carry over to the general setting we consider here. Therefore, in this paper we 
propose a novel approach to checking robustness that is fundamentally different 
from 8 . We provide a general, source-to-source reduction of the trace- robustness 
problem against TSO to the state reachability problem under SC semantics. 
In other words, we show that trace-robustness is not more expensive than SC 
state reachability, which is the unavoidable problem to be solved by any precise 



decision algorithm for concurrent programs. This is the key contribution of the 
paper from which we derive other results, such as decidability results in particular 
cases, as well as an algorithm for efficient fence insertion. 

To establish our reduction, we first provide a complete characterization of 
non-robustness in terms of so-called feasible attacks. An attack is a pair of load 
and store instructions of a thread, called the attacker, whose reordering can lead 
to a non-SC computation. In that case we say the attack is feasible, because 
it has a (TSO) witness computation. The special form of witness computations 
then allows us to detect them by tracking SC computations of an instrumented 
program. Given a potential attack, we show how to check its feasibility by solving 
an SC state reachability query in a linear-size instrumented program. The fact 
that only SC semantics (of the instrumented program) needs to be considered 
for detecting non-SC behaviors (of the original program) is important: it relieves 
us from examining TSO computations, which obliges to encode (somehow) the 
contents of store buffers (as in, e.g., |9|10| ). Interestingly, the detection of feasible 
attacks can be parallelized, which speeds up the decision procedure. Overall, 
we provide a general reduction of the TSO robustness problem to a quadratic 
number (in the size of the program) of state reachability queries under the SC 
semantics in linear-size instrumented programs of the same type as the original 
one. Our construction is source-to-source and is valid for (1) an arbitrary number 
of memory addresses/ variables, (2) an arbitrary data domain, (3) an arbitrary 
number of threads, and (4) unbounded store buffers. 

With this reduction, we can harness all available techniques and tools for 
solving reachability queries (either exactly, or approximately) in various classes 
of concurrent programs, regardless of decidability and complexity issues. It also 
yields decision algorithms for significant classes of programs. Assume we have a 
finite number of memory addresses, and the data domain is finite. Then, for a 
fixed number of threads, a direct consequence of our reduction is that the robust- 
ness problem is decidable and in PS pace since it is polynomially reducible to 
state reachability in finite-state concurrent programs |I8j . Therefore, we obtain 
in this case a simple robustness checking algorithm which matches the com- 
plexity upper bound proved in [8]. Our construction also provides an effective 
decision algorithm for the up to now open case where the number of threads is 
arbitrarily large. Indeed, state reachability queries in this case can be solved in 
vector addition systems with states (VASS), or equivalently as coverability in 
Petri nets, which is known to be decidable [26] and EXPSPACE-hard [23] • In 
both cases (fixed or arbitrary number of threads) the decision algorithms do not 
assume bounded store buffers. 

As last contribution, we address the issue of enforcing robustness by fence 
insertion. Obviously, inserting a fence after each store ensures robustness, but 
it also ruins all performance benefits that a relaxed memory model brings. A 
natural requirement on the set of fences is irreducibility, i.e., minimality wrt. set 
inclusion. Since there may be several irreducible sets enforcing robustness, it is 
natural to ask for a set that optimizes some notion of cost. We assume that we 
have a cost function that defines the cost of inserting a fence at each program 



location. For instance, by assuming cost 1 for all locations, we optimize the size 
of the fence set. Other cost functions reflect the performance of the resulting 
program. We propose an algorithm which, given a cost function, computes an 
optimal set of fences. The algorithm is based on 0/1-integer linear programming 
and exploits the notion of attacks to guide the selection of fences. 

We implemented the algorithms (using SPIN as a back-end reachability 
checker), and applied them successfully to several examples, including mutual 
exclusion protocols and concurrent data structures. The experiments we have 
carried out show that our approach is quite effective: (1) Many of the attacks 
to be examined can be discarded by simple syntactic checks (e.g., the presence 
of a fence between the store and load instructions), and those that require 
solving reachability queries are handled in few seconds, (2) the fence insertion 
procedure finds efficiently optimal sets of fences, in particular, it can handle 
the version of the Non-Blocking Write protocol [T7] described in [23] (where 
the write is guarded by a Linux x86 spinlock) for which Owens' method based 
on so-called triangular data races (see related work below) inserts unnecessary 
fences. 

Related work: There are only few results on decidability and complexity of 
relaxed memory models. Reachability under TSO has been shown to be decid- 
able but non-primitive recursive \7[ in the case of a finite number of threads and 
a finite data domain. In the same case, trace-robustness has been shown to be 
PSPACE-complete in [8] using a combinatorial approach. The method we adopt 
in this paper is conceptually and technically different from the one we took in 
[8]. While we reuse from [8] the fact that it is possible to reason on TSO compu- 
tations where only one thread has reordered its actions, we develop here a new 
approach where the main technical contributions reside in the characterization 
of non-robustness in terms of existence of feasible attacks and in the instrumen- 
tation we provide to reduce trace-robustness to SC state reachability. Besides 
getting decidability and complexity results, this reduction allows to leverage all 
the existing verification methods and tools for checking (SC) state reachabil- 
ity in various classes of programs to tackle the issue of checking and enforcing 
robustness against TSO. 

Alur et al. have shown that checking sequential consistency of a concurrent 
implementation wrt. a specification is undecidable in general [5J. This result 
does not contradict our findings: we consider here the special case where the 
implementation is the TSO semantics and the specification is the SC semantics 
of a program. In |14j . the problem of deciding whether a given computation is SC 
feasible has been proved NP-complete. Robustness is concerned with all TSO 
computations, instead. 

As mentioned above, the problem of checking and enforcing trace-robustness 
against weak memory models has been addressed in [911015] , but none of these 
works provide (sound and complete) decision procedures. Owens proposes in [23] 
a notion of robustness that is stronger than trace-robustness, based on detecting 
triangular data races. This allows for sound trace-robustness checking but, as 



Owens shows in his paper, in some cases leads to unnecessary fences which can 
be harmful for performance. Moreover, the notion of triangular data races defined 
in |24| comes without an algorithm for checking itQ. Complexity considerations 
(using the techniques in [5]) show that detecting triangular data races requires 
solving state reachability queries under SC. Therefore, as we show here, checking 
trace-robustness is not more expensive than detecting triangular data races. 

State-based robustness (which preserves the reachable states) is a weaker 
robustness criterion, but does not preserve path properties in contrast to trace- 
robustness, and is of significantly higher complexity (non-primitive recursive as 
it can be deduced from [7], instead of PSpace). It has been addressed in a 
precise manner in [5] where a symbolic decision procedure together with a fence 
insertion algorithm are provided. The same issue is addressed in |19I20| using 
over-approximate reachability analysis based on abstractions of the store buffers. 

Finally, let us mention work that considers the dual approach: starting from a 
robust program, remove unnecessary fences |29j . This work is aimed at compiler 
optimisations and does not provide a decision procedure for robustness. It can 
also not find an optimal set of fences the enforce trace-robustness. 

2 Parallel Programs 

Syntax We consider parallel programs with shared memory as defined by the 
grammar in Figure [T] Programs have a name and consist of a finite number of 
threads. Each thread has an identifier and a list of local registers it operates on. 
The thread's source code is given as a finite sequence of labelled instructions. 
More than one instruction can be marked by the same label; this allows us 
to mimic expressive constructs like conditional branching and iteration with a 
lightweight syntax. The instruction set includes loads from memory to a local 
register, stores to memory, memory fences to control TSO store buffers, local 
computations, and assertions. Figure [2] provides a sample program. 

We assume a program comes with two sets: a data domain DOM and a 
function domain FUN. The data domain should contain value zero: G DOM. 
Moreover, we assume that all values from DOM can be used as addresses. Each 
memory location accessed by loads and stores is identified by such an address, 
and memory locations identified by different addresses do not overlap. The set 
FUN contains functions that are defined on the data domain and can be used in 
the program. Note that we do not make any finiteness assumptions. 

TSO Semantics Fix a program V with threads THRD := {t\, . . . , t n }. Let each 
thread U have the initial label lo,i and declare registers rj- We define the set of 
variables as the union of addresses and registers: VAR := DOM U Uj e n n iri". We 
denote the set of all instruction labels that occur in threads by LAB. 

The TSO semantics is operational, in terms of states and labelled transitions 
between them. On the x86 TSO architecture, each processor effectively has a 
local FIFO buffer that keeps stores for later execution I25[27|9|10] . Therefore, 

1 Citation from [24| : "... formal reasoning directly on traces can be tedious, so a pro- 
gram logic or sound static analyzer specialized to proving triangular-race freedom 
might make the application of TRF more convenient." 



{prog) ::= program (pid) (thrd)* 
{thrd) ::= thread (tid) 

regs {reg)* 

init {label) 

begin {linst) end 
(linst) ::= {label): (inst); goto {label); 
(inst) ::= (reg) <— mem[(e:z:pr)] 

mem [( expr) ] •<— {expr) 

mfence 

(reg) <— {expr) 
assert {expr) 
{expr) ::= {fun) (.{reg}*) 

Fig. 1: Syntax of parallel programs. 



program Dekker 

thread ti regs n init Zo begin 
Zo : mem [x] <— 1 ; goto Zi ; 
Zi : n <— mem[y]; goto Z2 ; 
end 

thread ti regs 7-2 init Zq begin 
Zo : mem [y] <— 1 ; goto l'i ; 
l[: 7~2 <— mem[x]; goto l' 2 ; 
end 

Fig. 2: Simplified version of Dekker's 
mutex algorithm. Under SC, it is im- 
possible that r\ = r2 = when both 
threads reach I2 and 1' 2 - 



a state is a triple s = (pc,val,buf) where program counter pc: THRD — > LAB 
holds, for each thread, the label of the instruction(s) to be executed next. The 
valuation val : VAR — > DOM gives the values of registers and memory locations. 
The third component buf: THRD ->• (DOM x DOM)* is the (per thread) buffer 
content: a sequence of address-value pairs a 4— v. 

In the initial state Sq :— (pc , valo, bufo) the program counter is set to the 
initial labels, pc (ii) = lo.i for all ti G THRD, registers and addresses hold value 
zero, valo(x) = for all x G VAR, and all buffers are empty, bufo(i) := e for all 
t G THRD. Here, e denotes the empty sequence. 

Instructions yield transitions between states that are labelled by actions from 
ACT := THRDx({isu, loc} U ({Id, st} x DOM x DOM)). An action consists of the 
thread id and the actual arguments of an executed instruction. We use loc to 
abstract assignments and asserts that are local to the thread. The issue action isu 
indicates that a store was executed on the processor. The store action (t, st, a, v) 
gives the moment when the store becomes visible in memory. 

The TSO transition relation — >tso is the smallest relation between TSO 
states that is defined by the rules in Table [TJ The rules repeat, up to notation and 
support for locked instructions, Figure 1 from [25] . The first two rules implement 
loads from the buffer and from the memory respectively. By the third rule, 
store instructions enqueue write operations to the buffer. The fourth rule non- 
deterministically dequeues and executes them on memory. The fifth rule defines 
that memory fences can only be executed when the buffer is empty. The last two 
rules refer to local assignments and assertions. We omitted locked instructions 
to keep things simple. Their introduction is straightforward and does not affect 
the results. Indeed, our implementation supports them [T]. 

The set of TSO computations contains all sequences of actions that lead from 
the initial TSO state to a state where all buffers are empty: 



CjSoi'P) := {t G ACT* I s -^tso s for some TSO state 

s = (pc, val, buf) with buf(i) = e for all t G THRD}. 



(instr) = r <- mem[/ a (r a )] , a = / a (val(r a )), buf(t)|(a <- *) = /3 ■ (a <- v) 
(pc,val,buf) "'"'"'"V so ( pc ' )Va |[ r := v ],buf) 

(instr) — r <- mem[/ n (f^)] , a = / a (val(r^)), buf(t)4-(a «- *) = e, « = val(a) 
(pc,val, buf) '''"'''"V so (pc',val[r :=u],buf) 

(znstr) = mem[/ a (ry)] <- f v (r^), a = / a (val(f^)), « = /„(val(f^)) 
(pc, val, buf) (t ' lsu) > TS0 (pc', val, buf[t := buf(t) ■ (a v)]) 

buf(t) = (a<r-v)-P 



(pc, val, buf) t ' st '°'" > tso (pc, val [a := v],buf[t := /?]) 
(instr) — mf ence, buf(t) = e 



(pc, val, buf) (t ''° c) > TSO (pc', val, buf) 

(instr) = r f(r) 



(pc, val, buf) (t ''° c) > TSO (pc',val[r := /(val(r))], buf) 

(instr) — assert f(r), /(val(r)) ^ 
(pc, val, buf) (t, '° c) > TSO (pc', val, buf) 



Table 1: TSO transition rules, assuming pc(t) = I, an instruction (instr) at label 
I with destination I', and pc' := pc[t := I']. We use 4- to denote projection and * 
for any value, i.e., buf(i)4-(a *) is a list of address-value pairs in the buffer of 
thread t having the address a. 

The requirement of empty buffers is not important for our results but rather 
a modelling choice. Figure [3] presents a TSO computation of Dekker's program 
where the store of the first thread is delayed past the load. 



t =(ti, isu) (ti, Id, y, 0) (t 2 , isu) (i 2 , st,y, 1) (t 2 , ld,x, 0) (ti,st,x, 1) 

Fig. 3: A TSO computation of Dekker's algorithm. Actions drawn in red belong 
to the first thread, actions in blue belong to the second thread. The arc connects 
the issue action with the corresponding delayed store action of the first thread. 

SC Semantics Under SC [3T], stores are not buffered and hence states are pairs 
(pc, val). The rules for SC transitions are appropriately simplified TSO rules. To 
avoid case distinctions between TSO and SC in the definition of traces, a store 
instruction generates two actions: an issue followed by the store. Memory fences 
have no effect under SC. We denote the set of all SC computations of V by 




Csc^) := {a G ACT* | so Asc s for some SC state s}. 



3 TSO Robustness 



Robustness ensures that the behaviour of a program does not change when it 
is run on TSO hardware as compared to SC. We study trace-based robustness 
as in |28|9|10|518] . Traces capture the essence of a computation: the control and 
data dependencies among actions. More formally, consider some computation 
a G CscCP) U Cjso^)- The trace Tr(a) is a graph where the nodes are labelled 
by the actions in a (stores and issue yield one node). The arcs are defined by the 
following relations. We have a per thread t S THRD total order — >-p that gives 
the order in which the actions of t where issued. Similarly, we have a per address 
a G DOM total order — that gives the ordering of all stores to a. We call the 
unions — > po := UtgjHRD — > PO and — > st := U a eDOM the program order and the 
store order of the trace. Finally, there is a source relation — > src that determines 
the store from which a load receives its value. By Tr mm (V) := Tr(C mm (V)) with 
mm € {SC,TSO} we denote the set of all SC/TSO traces of program V. The 
TSO robustness problem checks whether the sets coincide. 

Given: A parallel program V '. 

Problem: Does Tr T so(^) = Trsc(7>) hold? 

Since inclusion Trsc('P) Q TrTSo('P) always holds, we only have to check the 
reverse inclusion. We call a computation r £ CtsoC^) violating if its trace is 
not among the SC traces of the program, i.e., Tr(r) Trsc('P). Violating TSO- 
computations employ cyclic accesses to addresses that SC is unable to serialize 
|28j . The cyclic accesses are made visible using a conflict relation from loads 
to stores. Intuitively, Id — > c f st means that st overwrites a value that Id reads. 
The union of all four relations is commonly called happens-before relation of the 
trace, -^ hb := ^ P o U -> s t U -^ S rc U -» cf . 

Lemma 1 ([28]). Consider TSO trace Tr(r) e Tr TS o(V)- Then Tr(r) e Tr S c{V) 
iff — >hb is acyclic. 

Consider the computation in Figure [3] The load from thread t\ conflicts with 
the store from ti because the load reads the initial value of y while the store 
overwrites it. The situation with the load from ti and the store from t\ is sym- 
metric. Together with the program order, the conflict relations produce a cycle: 

cf 

Ci-st.x. I i « t (t 2 ,ld,x,0) 

\po po 

(*i,ld,y,0) k i (t 2 ,st,y,l) 

cf 

Indeed, there is no SC computation with this trace, as predicted by Lemma [T] 

Lemma Q] does not provide a method for finding cyclic traces. We have re- 
cently shown that TSO robustness is decidable, in fact, PSPACE-complete [5]. 
The algorithm underlying this result, however, is based on enumeration and not 
meant to be implemented. The main contribution of the present work is a novel 
and practical approach to checking robustness. 



Fig. 4: TSO witness for the attack (tA, stinst, Idinst). It satisfies the following 
constraints. (Wl) Only the attacker delays stores. (W2) Store stA is an instance 
of stinst. It is the first store of the attacker that is delayed. Load IcJa is an instance 
of Idinst. It is the last action of the attacker that is overstepped by stA- So T2 
contains loads, assignments, asserts, and issues, but no fences and stores of the 
attacker. It may contain arbitrary helper actions. (W3) We require IcJa — >-jJ~ b act 
for every action act in Id a • t 3 • stA- An issue + store of a helper is counted as one 
action act. (W4) Sequence T4 only consists of stores of the attacker that were 
issued before IcJa and that have been delayed. (W5) All these stores st satisfy 
addr(st) ^ addr(\dA.), i.e., IdA has not read its value early. 

The only concept we keep from our earlier work are minimal violations. A 
minimal violation is a violating computation that uses a minimal total number 
of delays. Interestingly, for minimal violations the following holds. 

Lemma 2 (Locality |8j, Appendix [Bj. In a minimal violation, only a single 
thread delays stores. 

Consider the computation in Figure [3] It relies on a single delay in thread t\ and, 
indeed, is a minimal violation. As predicted by the lemma, the second thread 
writes to its buffer and immediately flushes it. 

4 Attacks on TSO Robustness 

Our approach to checking TSO robustness combines two insights. We first 
rephrase robustness in terms of a simpler problem: the absence of feasible attacks. 
We then devise an algorithm that checks attacks for feasibility. Interestingly, SC 
reachability techniques are sufficient for this purpose. Together, this yields a 
sound and complete reduction of TSO robustness to SC reachability. 

The notion of attacks is inspired by the shape of minimal violations. We show 
that if a program is not robust, then there are violations of the form shown in 
Figure EJ one thread, the attacker, delays a store action stA past a later load 
action IdA in order to break robustness. The remaining threads become helpers 
and provide a happens-before path from IdA to stA. This yields a happens-before 
cycle and shows non-robustness. 

Thread, store instruction stinst of stA, and load instruction Idinst of IdA are 
syntactic objects. The idea of our approach is to fix these three parameters, 
the attack, prior to the analysis. The algorithm then tries to find a witness 
computation that proves the attack feasible. 

Definition 1. An attack A = (t a, stinst, Idinst) consists of a thread tA 6 THRD 
called attacker, a store instruction stinst and a load instruction Idinst. A TSO 
witness for A is a computation of the form in Figure i.e., it satisfies (Wl) 
to (W5). If a TSO witness exists, the attack is called feasible. 

In Dekker's algorithm, there is an attack A = (tA, stinst, Idinst) with tA = ii, 
stinst the store at Iq, and Idinst the load at l-y. A TSO witness of this attack is 



the computation t from Figure [3] With reference to Figure 0] we have T\ = e, 
isu stA = (<i,isu), t 2 =£, ld A = Id, y, 0), t 3 = (t 2 ,isu) • (t 2 ,st,y, 1) • (i 2 ,ld,x,0), 
stA = (ii, st, x, 1), T4 = e. The program also contains a symmetric attack A' with 
i 2 as the attacker. 

Although TSO witnesses are quite restrictive computations, robustness can 
be reduced to verifying that no attack has a TSO witness. 

Theorem 1 (Complete Characterization of Robustness with Attacks). 

Program V is robust iff no attack is feasible, i.e., no attack admits a TSO witness. 

Proof. The existence of a TSO witness implies non-robustness of the program. 
Indeed, a TSO witness comes with a happens-before cycle st A — >p Q IcJa — >^ b sty\. 
We argue that also the reverse holds: if a program is not robust, there is a feasible 
attack. Assume V is not robust. We construct a TSO witness computation. 
Among the violating computations, we select r £ CtsoC^) where the number of 
delays is minimal. The computation need not be unique. By Lemma[2j in r only 
one thread tA uses its buffer and (Wl) holds. We elaborate on the shape of r. 

Initially, the attacker executes under SC so that stores immediately follow 
their issues. This computation is embedded into t% in Figure [4] Eventually, the 
attacker starts delaying stores. Let stA be the first store that is delayed. It gets 
reordered past several loads, the last of which being IcJa - This shows (W2). 

The helper actions in r 3 are depicted in blue in Figure |U To see that we can 
assume (W3), first note that Id a — »jj" b stA holds. If there was no such path, stA 
could be placed before IcJa without changing the trace. This would save a delay, 
in contradiction to minimality of r. Assume r 3 = r 3 • act • r 3 where act is the first 
action so that Id a -fi^b act. Then act is independent from all actions in IdA • t 3 . 
We find a computation with the same trace where act is placed before IdA- 

With cycle stA — >t ^a — >^b stA ' computation r 4 only needs to contain the 
stores of the attacker that have been delayed past IdA. Since these stores are 
non-blocking, the helpers can stop with the last action in r 3 . We can moreover 
assume IdA to be the program order last action of the attacker. (W4) holds. 

We now argue that IdA has not read its value early from any of the delayed 
stores, (W5). Towards a contradiction, assume IdA obtained its value from st in 
T4 = T41 • st • T4 2 . There is a computation r where we avoid the early read: it 
replaces T4 by T41 • st • IdA • T4 2 . The traces of r and r' coincide, but r' saves the 
delay of st past IdA- A contradiction to minimality. 

It is readily checked that r is a TSO witness for the attack (tA,stinst, Idinst) 
where stinst and Idinst are the instructions that stA and IdA are derived from. □ 

Since the number of attacks is only quadratic in the size of the program, we 
can just enumerate them and check whether one admits a TSO witness. To 
check whether a witness exists, we employ the instrumentation described in the 
following section. 

5 Instrumentation 

Consider program V with attack A = (tA, stinst, Idinst). TSO witnesses for A only 
make limited use of buffers, to an extent that allows us to characterize them 



by SC computations in a program V/\ that is instrumented for attack A. By 
instrumentation we mean that Vf\ replaces every thread by a modified version. 
Capturing TSO witnesses with a program that runs under SC is difficult for 
two reasons. First, TSO has unbounded store buffers which can delay stores 
arbitrarily long. Second, the happens-before dependence that the helpers create 
may involve an arbitrary number of actions. Our instrumentation copes with 
both problems using the following tricks. 

To handle store buffering, we instrument the attacker thread (Section 15. f p . 
Essentially, we emulate store buffering under SC using auxiliary addresses. To 
explain the idea, consider the TSO witness in Figure 2J When the instrumented 
attacker executes the delayed stores stA • T4 under SC, they occur right behind 
their issue actions. To mimic store buffering, these stores now access auxiliary 
addresses that the other threads do not load. As a result, the stores remain 
invisible to the helpers. This is as intended: the delayed stores stA -T4 in Figure|4] 
are also never accessed by helper threads. But how many auxiliary addresses do 
we need to faithfully simulate buffers? It is sufficient to have a single auxiliary 
address per address in the program. The reason is that a load always reads the 
most recent store to its address that is held in the buffer. 

To build up a happens-before path from IcU to stA, we instrument the helper 
threads fSection l5T2")) . The question is how to decide whether a new action act is in 
happens-before relation with an earlier action act' so that Id a — >ub ac t' ~ ^hb ac ^- 
What is the information we need about the earlier actions in order to append act? 
It is sufficient to know two facts. Has the thread already contributed an action 
act'? This information ensures act' — >* a act, and can be kept in the control flow 
of the thread. Moreover, we keep track of whether the path contains a load or 
store access to the address addr(act). If there was a load access act' = Id, we 
can add a store act = st and get Id — >^ h st. If there was a store, we are free to 
add a load or a store. Hence, we need one auxiliary address per address in the 
program for this access information: no access, load access, store access. 

Consider the TSO witness for Dekker given in Figure [3j Instead of buffering 
(ti,st, x, 1), the instrumentation immediately executes the store after its issue 
action. But instead of address x, the action accesses the auxiliary address (x, d) 
that the other threads do not load. To indicate that this store is invisible to the 
helper threads, we depict it in gray. So, the SC computation of the instrumented 
program roughly looks like this: 

(ti, isu) • fa, st, (x, d), 1) • (ti, Id, y, 0) V (t 2 , \su)(t 2 , st, y, 1) (2) (t 2 , Id, x, 0). 

At moment (1), we know that there has been a load access to address y. At 
moment (2), address y has even seen a store. At the end of the computation, 
address y has seen a store and address x has seen a load. 

The store of ti can be appended since it is in happens-before relation with the 
attacker's load. The following load can be added as ti has contributed the pre- 
vious store. The search terminates here since the helper's load accesses address 
x that was used by the store from the attack. 



5.1 Instrumentation of the Attacker 



The instrumentation emulates the buffering of stores in a TSO witness (Fig- 
ure 0}. Starting from stA, the stores are replaced by stores st A LX to auxiliary 
addresses (a, d) that are only visible to the attacker. As long as a has not been 
written, (a, d) holds the initial value 0. Once the attacker stores v into a, we 
set mem[(a,d)] = (v, d). In this way, (a, d) always holds the most recent store. 
A load r «— mem [a] of the attacker reads a value v from the buffer whenever 
mem[(a,d)] = (v, d); otherwise mem [(a, d)] = and the load obtains the value 
v = mem [a] from memory. We turn to the translation. 

Let tA declare registers r*, have initial location Iq, and define instructions 
(linst)* that contain stinst and Idinst from the attack. The instrumentation is 

[tA] := thread tA regs r* init lo 

begin (linst)* [stinstjAi [IdinstjAi {(Unst)]^ en d- 

It introduces a copy of the source code [(Zmsi)] A2 where the stores are replaced 
by accesses to auxiliary addresses. To move to the code copy, the attacker uses 
an instrumented version [stinstjAi of stinst. 



Ii: mem[ei] <— ei ; goto b;]Ai 
[h : r <— mem[e] ; goto 



[h: mem[ei] <— C2 ; goto b;]A2 
[h : r <— mem[e] ; goto b;]A2 



Ii : mem[(ei,d)] «— (e2,d); goto \ x ; 
\ x : mem[a stA ] <— ei ; goto b ; 

Ii : assert mem [(e, d)] =0; goto 
\ x \\ mem[hb] <— true; goto \ x 2\ 
\ x2 : mem[(e,hb)] <- Ida; goto \ x3 ; 

h: mem[(ei,d)] <— (e2,d); goto I2 ; 

Ii : assert mem[(e, d)] =0; goto bi ; 
bi : r <— mem [e] ; goto I2 ; 
Ii : assert mem[(e, d)] =^0; goto \ x 2 ; 
b 2 : (r, d) mem [(e,d)] ; goto l 2 ; 
Ii : local ; goto b ; 



[Ii : local; goto I2;]a2 
[b: mfence; goto b;]A2 := 

Fig. 5: Instrumentation of the attacker 



(1) 
(2) 



(3) 
(4) 



(5) 
(6) 



The translation of instructions is defined in Figure [5] We make a few re- 
marks. The instrumentation of stinst = b : mem[ei] <— ei ; goto I2; keeps the 
address used in the store in a fresh address a stA . For the sake of readability, 
in Equation (0| we use memory accesses in instructions other than load and 
store. Equation © deletes fences, as they forbid to delay stA over IdA- Let 
Idinst = b : r mem [e] ; goto be the load used in the attack. Equation 
checks that the load has not read its value early and sets an auxiliary happens- 
bcforc address (e, hb) to access level load, Ida. We postpone the definition of 



thread t\ regs ri init Iq begii 

/* Original code */ 

lo: mem[x] <— 1; goto li ; 

li : n ■<— mem[y]; goto Z2 ; 

/* Instrumented stinst */ 

Iq: mem [(x, d)] <— (1, d); goto \ x 

\ x : mem[a s t A ] -s— x; goto li ; 



/* Instrumented copy of the store */ 
lo : mem[(x,d)] <— (1, d) ; goto li; 

/* Instrumented copy of the load */ 
li : assert mem[(y,d)] = 0; goto ; 
l x4 : r <r- mem[y]; goto l 2 ; 
li : assert mem[(y,d)] 7^ 0; goto ; 
\ x s- (r,d) <- mem[(y,d)]; goto l 2 ; 



/* Instrumented ldinst */ 

li : assert mem[(y,d)] = 0; goto \ x i; 

hi: mem [rib] <— true; goto \ x 2 ; 

\ x2 : mem[(y, hb)] <— Ida; goto \ x3 ; 

end 

Fig. 6: Attacker instrumentation of thread t\ in Dckker from Figure [2] The at- 
tack's store is the store at label l , the load is the load at label h. 



access levels until the translation of helpers. It also sets hb flag for helpers to 
indicate that they cannot execute actions not contributing to the happens-before 
path. Figure |5] illustrates the instrumentation on our running example. 

5.2 Instrumentation of Helpers 

In TSO witnesses, by (W3), all helper actions after IdA are in happens-before 
relation with IdA- To ensure this, we make use of Lemma |3J The proof from left 
to right is by definition of happens before. For the reverse direction, note that 
happens-before is stable under insertion. Consider st — > src Id. A happens-before 
relation remains valid in any computation that places actions between st and Id. 



Lemma 3. Consider t = t\ ■ act\ ■ t-i G Csci'P) where for all acti in T2 we have 
act\ — >j![ b act2- Computation r • act satisfies act\ — > bb act iff 

(i) there is an action acti in act\ ■ t-i with thread{act2) = thread(act) or 

(ii) act is a load whose address is stored in act\ ■ T2 or 

(iii) act is a store (with issue) whose address is loaded or stored in act\ ■ t%. 

The lemma suggests the following instrumentation. For every helper t, we track 
whether it has executed an action that depends on IdA- The idea is to use the 
control flow. Upon detection of this first action, the thread moves to a copy of 
its code. All actions from this copy stay in happens-before relation with IdA. 

It remains to decide whether an action act allows a thread to move to the 
code copy. According to Lemma [31 this depends on the earlier accesses to the 
address a = addr(act). We introduce auxiliary happens-before addresses (a, hb) 
that provide this access information. The addresses (a, hb) range over the domain 
{0,lda,sta} of access types. It is sufficient to keep track of the maximal access 
type wrt. the ordering (no access) < Ida (load access) < sta (store access). 



[Ii: instr; goto I 2 ;]ho := 

[li : r <— mem[e] ; goto la j]hi := 

[Ii : mem[ei] e 2 ; goto I2;]hi := 

[li: local /mfence; goto I 2 ;]h 2 := 

[Ii: mem[ei] <S— e 2 ; goto I 2 ;]h2 := 

[li : r <— mem[e] ; goto l 2 ;]h2 := 



i: assert mem[hb] =0; goto 1^ ; (7) 
x : instr ; goto l 2 ; 

i: assert mem [(e, hb)] = sta ; goto \ x ; (8) 
x : r ■(— mem [e] ; goto l 2 ; 

i: assert mem[(e, hb))] > Ida; goto \ x i', (9) 
x i : mem[ei] «— e 2 ; goto \ x 2\ 
x2- mem[(ei,hb)] •<— sta; goto l 2 ; 

i: local/mfence; goto l 2 ; (10) 

i : mem [ei] <— e 2 ; goto l e ; (11) 
e : mem [(ei, hb)] <S— sta ; goto l 2 ; 

i : f <— e; goto \ x i ; (12) 
xi : r <S— mem [f ] ; goto \ x 2 ; 

x2- mem[(f,hb)] 4— max{lda,mem[(f, hb)] }; goto l 2 ; 
: f 4- mem[a stA ] ; goto Ui ; (13) 
sci : f •<— mem [(f , hb)] ; goto l X 2 ; 
t2 : assert f ^ 0; goto l X 3 ; 
Z3 : mem [sue] true; goto U4 ; 

Fig. 7: Instrumentation of helpers. 



For the definition, consider a helper thread t that declares r* , has initial label 
Irj, and defines instructions (linst)* . The instrumented thread is 



[i] := thread t regs f, r* 



init Iq 



begin 



iins^Ho* [(/dstinst)] hi [(Zmst)] H2 [(')]h3 end - 



Here, (Idstinst)* is the subsequence of all load and store instructions. Their 
instrumentation [(Wstmst)]^ is used to move to the code copy [(Zms£)] H2 . 
Moreover, (I)* are all labels used by the thread. The additional instructions 
[(I)]h3 raise a success flag when a TSO witness has been found. [(Zmst)] ho forces 
helpers to either enter the code copy or stop when hb flag is raised. 

The translation of instructions is given in Figure [7J We make some remarks. 
Transitions to the code copy check the auxiliary addresses for whether the current 
action is in happens-before relation with IcJa - Loads in Equation ([8]! check for an 
earlier store access to their address, Lemma [3jii). Stores in Equation (|9]) require 
that the address has seen at least a load, Lemma IHJiii). They set the access level 
to sta. Loads and stores in the code copy maintain the auxiliary addresses to 
contain the maximal access types, Equations (IT21) and (fTTj) . Note the auxiliary 
register f that ensures we do not overwrite the address. At every label of the code 
copy we add a check, Equation (TT3|) . whether the address used in the attack's 
store has been accessed in the code copy. If so, a success flag is raised. 



5.3 Soundness and Completeness 

The flag indicates that the SC computation corresponds to a TSO witness, and 
we call (pc,val) with val(suc) = true a goal configuration. The instrumentation 
thus reduces feasibility of attack A to SC reachability of a goal configuration in 
program V/\. The instrumentation is sound and complete. If a goal configuration 
is reachable, we can reconstruct a TSO witness for the attack. In turn, every 
TSO witness ensures the goal configuration is reachable. 

Theorem 2 (Soundness and Completeness). Attack A = (t^, stinst, Idinst) 

is feasible in program V iff program Va reaches a goal configuration under SC. 

In combination with Theorem [TJ we can check robustness by inspecting all V/\. 

Theorem 3 (From Robustness to SC Reachability). Program V is robust 
iff no instrumentation Va reaches a goal configuration under SC. 

The instrumentation we provide is linear in size. Then, it follows from Theorem[3] 
that checking robustness for programs over finite data domains is in P Space. 
The problem is actually PSPACE-complete due to the lower bound in [8]. 

6 Robustness for Parameterized Programs 

We extend the study of robustness to parameterized programs. A parameterized 
program represents an infinite family of instance programs that replicate the 
threads multiple times. Syntactically, parameterized programs coincide with the 
parallel programs we introduced in Section [2j they have a name and declare a 
finite set of threads t\, . . . ,tk- The difference is in the semantics. A parameterized 
program represents a family of programs: for every vector / = (ni, . . . , n/-) <E N fc , 
a program instance V(I) declares rij copies of thread ti. 

In the parameterized setting, the robustness problem asks whether all in- 
stances of a given program are robust: 

Given: A parameterized program V. 

Problem: Does Trrso^CO) = Tr Sc{V(I)) hold for all instances V(I) of VI 

The problem is interesting because libraries usually cannot make assumptions on 
the number of threads that use their functions. They have to guarantee proper 
functioning for any number. 

We reduce robustness for parameterized programs to a parameterized version 
of reachability, based on the following insight. A parameterized program is not 
robust if and only if there is an instance V(I) that is not robust. With Theorem[TJ 
instance V(I) is not robust if and only if there is an attack A that is feasible. 
With the instrumentation from Section [5] and Theorem [3l this feasibility can be 
checked as reachability of a goal configuration in V(I)a- 

Algorithmically, it is impossible to instrument all (infinitely many) instance 
programs. Instead, the idea is to instrument directly the parameterized program 
V towards the attack A. Using the constructions from Section[5j we modify every 
thread and again obtain program V/\, which is now parameterized. 



Actually, for the attacker we have to be slightly more careful. In an instance 
program, only one copy of the thread should act as attacker, the remaining copies 
have to behave like helpers. Therefore, the thread must be instrumented not only 
as an attacker, but also as a helper. To ensure that only one copy of the attacker 
delays stores, we add an additional flag variable. Before starting an attack, the 
thread checks this variable. If it contains the initial value, the thread sets the 
flag and starts delaying stores. If it has a different value, the thread continues to 
run sequentially. This check requires an atomic test-and-set operation which can 
be implemented on x86 by the lock cmpxchg instruction. Support for locked 
instructions is immediate to add to our programming model. 

Modulo these two changes, the instances Va(I) coincide with the instru- 
mentations V(I)a- Together with the argumentation in last two paragraphs this 
justifies the following theorem. 

Theorem 4. A parameterized program V is not robust iff there is an attack A 
so that an instance Va(I) of program Va reaches a goal configuration under SC. 

Reachability of a goal configuration in one instance of Va can be reformulated 
as a cover ability problem for Petri nets, which is known to be decidable [26j . 
The key observation in the reduction to Petri nets is that threads in instance 
programs never use their identifiers, simply because they are copies of the same 
source code. This means there is no need to track the identity of threads, it 
is sufficient to count how many instances of a thread are in each state — a 
technique known as counter abstraction [13 . 

Theorem 5. Robustness for parameterized programs over finite data domains 
is decidable and EXPSPACE-Ziard — already for Boolean programs. 

For the lower bound, we in turn encode the coverability problem for Petri nets 
into robustness for parameterized programs |1I23) 

7 Fence Insertion 

To ease the presentation, we return to parallel programs. Since the algorithm 
only relies on a robustness checker, it carries over to the parametric setting. 

Our goal is to insert a set of fences that ensure robustness of the resulting 
program. By inserting a fence at label I we mean the following modification of 
the program. Introduce a fresh label I/. Then, translate each instruction I : inst; 
goto I ; into I/: inst; goto I ;. Finally, add an instruction I: mfence; goto \f\. 

We call a set of labels J- in program V a valid fence set if inserting fences 
at these labels yields a robust program. We say that T is irreducible if no strict 
subset is a valid fence set. In general, however, we would like to compute a valid 
fence set which is optimal in some sense. We pose the fence computation problem: 

Given: A program V and a strictly positive cost function C: LAB — > K + . 
Problem: Compute a valid fence set T with S\ e jrC(\) minimal. 

Since we assume C to be strictly positive, every optimal fence set is irreducible. 

We consider two criteria of optimality: minimization of program size and 
maximization of program performance. By solving the problem for C = 1 we 



compute a fence set of minimal size, thus minimizing the code size of the fenced 
program. Maximization of program performance requires minimizing the number 
of times memory fence instructions are executed: practical measurements [I] 
show that it is impossible to save CPU cycles by executing more fences, but 
with less stores in the TSO buffer. For this, C(l) is defined as the frequency 
at which instructions labeled by I occur in executions of the original program 
V . Concrete values of C can be either estimated by profiling or computed by 
mathematical reasoning about the program. 

From a complexity point of view, fence computation is at least as hard as 
robustness. Indeed, robustness holds if and only if the optimal valid fence set is 
J- = 0. Actually, since fence sets can be enumerated, computing an optimal valid 
fence set does not require more space than checking robustness. Notice that this 
also holds in the parameterized case. 

Theorem 6. For programs over finite domains, fence computation is PSpace- 
complete. In the parameterized case, it is decidable and EXPSPACE-/iard. 

In the remainder of the section, we give a practical algorithm for computing 
optimal valid fence sets. 

7.1 Fence Sets for Attacks 

We say that a label I is involved in the attack A = (tA, stinst, Idinst) if it belongs 
to some path in the control flow graph of tA from the destination label of stinst 
to the source label of Idinst. We denote the set of all such labels by £a- 

We call a set of labels an eliminating fence set for attack A if adding fences 
at all labels in Ta eliminates the attack. Dekker's algorithm has two eliminating 
fence sets: J-"a = {h} eliminates the only attack by t\, and J 7 ^ = {l[} eliminates 
the only attack by ti. Actually, the sets are irreducible: no strict subset eliminates 
the attack. Note that any irreducible eliminating set Ta satisfies Ja Q £a- 

Lemma 4. Every irreducible valid fence set J- can be represented as a union of 
irreducible eliminating fence sets for all feasible attacks. 

Proof. By Theorem [TJ fence set _F eliminates all feasible attacks. Therefore, it 
includes some irreducible eliminating fence set J-/\ for every feasible attack A. 
By irreducibility, T cannot contain labels outside the union of these sets. □ 

In compliance with the lemma, in the Dekker's program J- = U ■ 

Lemma HI is useful for fence computation since optimal fence sets are always 
irreducible. All irreducible eliminating fence sets for attacks can be constructed 
by an exhaustive search through all selections of labels involved in the attack. 
For each candidate fence set, to judge whether it eliminates the attack, we check 
SC reachability in the instrumented program as described in Section [5] 

Note that this search may raise an exponential number of reachability queries. 
In practice this rarely constitutes a problem. First, attacks seldom have a large 
number of involved labels, so the number of candidates is small. Second, the 
reachability checks can be avoided if a candidate fence set covers all the ways in 
the control flow graph from stinst to Idinst. 



7.2 Computing an Optimal Valid Fence Set 

To choose among the sets J- a, we set up a 0/1-integer linear programming (ILP) 
problem M-p ■ x-p > bp. The optimal solutions f(xp) — > min correspond to 
optimal fence sets. Here, 0/1 means the variables are restricted to yield Booleans. 

We define inequalities that encode the feasible attacks with their corrections. 
Consider attack A for which we have determined the irreducible eliminating 
fence sets J-'x, . . . , T n . For each set, we introduce a variable x^ i and set up 
Inequality (|14p(left). It selects a fence set to eliminate the attack. 

x ^> 1 £*i> loiter, f{x P ) := C(\)xi. (14) 

l<i<n le.Fi leLAB 

When Ti has been chosen, we insert a fence at each of its labels I. We add further 
variables x\, and encode this insertion by Inequality (|14p (center). By definition 
of the ILP, the variables x? i and x\ will only take Boolean values or 1. So if 
xjr. is set to 1, the inequality requires that all x\ with I 6 T% are set to 1. 

Our goal is to select fences with minimal costs. We encode this into the 
objective function (fT4|) (right). An optimal solution x* of the resulting 0/1-ILP 
denotes the fence set F(x*) := {I G LAB | x* = 1}. 

Theorem 7. J-(x*) is valid and optimal, and thus solves fence computation. 
8 Experimental Evaluation 

We implemented our algorithms in a prototype called Trencher |T|. The tool 
performs the reduction of robustness to SC reachability given in Section [5] and 
computes a minimal fence set as described in Section [7] Trencher executes 
independent reachability queries in parallel and uses Spin |16j as back-end model 
checker. With Trencher, we have performed a series of experiments. 

8.1 Examples 

The first class of examples are mutual exclusion protocols that are implemented 
via shared variables. These protocols are typically not robust under TSO and 
require additional fences after stores to synchronization variables. We studied 
robust and non-robust instances of Dekker and Peterson for two threads, as well 
as Lamport's fast mutex [22] for three threads. Moreover, we checked the CLH 
and MCS Locks, robust list-based queue locks that use compare-and-set [T5] . 

As second class of examples, we considered concurrent data structures. 
The Lock-Free Stack is a concurrent stack implementation using compare-and- 
swap [T5]. Cilk's THE WSQ is a work stealing queue from the implementation 
of the Cilk-5 programming language |12) . 

Finally, we consider miscellaneous concurrent algorithms that are known to 
be sensitive to program order relaxations. We analyse several instances of the 
Non-Blocking Write protocol [17J . NBWL is the spinlock + non-blocking write 
example considered by Owens in Section 8 of [24] . Finally, our tool discovers the 
known bug in Java's Parker implementation that is due to TSO relaxations |llj . 

The test inputs are available online [T]. 
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Peterson (non-robust) 
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Peterson (robust) 
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Dekker (non-robust) 
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Dekker (robust) 
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Lamport (non-robust) 
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CLH Lock (robust) 
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MCS Lock (robust) 
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NBWL (robust) 
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Parker (robust) 
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10 
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0.0 


0.0 



Table 2: Benchmarking results. 



8.2 Benchmarking 

We executed Trencher on the examples, using a machine with Intel(R) 
Core(TM) i5 CPU M 560 @ 2.67GHz (4 cores) running GNU/Linux. Table [1 
summarizes the results. The columns T, L, and I give the number of threads, 
labels, and instructions in the example. RQ is the number of reachability queries 
raised by Trencher. Provided the example is robust, this number is equal to 
the number of attacks (tA, stinst, Idinst). NR1 is the number of verification queries 
that were answered negatively by Trencher itself, without running Spin. Such 
queries correspond to attacks where stinst cannot be delayed past Idinst because 
of memory fences or locked instructions in between. NR2 and R are the numbers 
of queries that are answered negatively/positively by the external model checker. 
Hence, RQ = NR1 + NR2 + R. F is the number of fences inserted. 

The column Spin gives the total CPU time taken by Spin and Clang, the C 
compiler, to produce a verifier executable (pan). The column Ver provides the 
total CPU time taken by Trencher and the external verifier. Real is the wall- 
clock time in seconds of processing an example. All times are given in seconds. 

8.3 Discussion 

The analysis of robust algorithms is particularly fast. They typically only have 
a small number of attacks that have to be checked by a model checker. Robust 
Dekker and Peterson do not have such attacks at all. In the CLH and MCS locks, 
their number is less than 20%. 

In some examples (non-robust Dekker, CLH Lock, NBW2, NBW4), up to 
94% of the CPU time was spent on generating verifiers. This leaves room for 
improvement by switching to a model checker without compilation phase. For 



some examples (LamNR, CLHLock), the wall-clock time constitutes 1/3 to 1/4 
of the CPU time (4-cores) . This confirms good parallelizability of the approach. 

Remarkably, our trace-based analysis can establish robustness of the NBWL 
example, as opposed to the earlier analyses via triangular data races which would 
have to place a fence [24] . 
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A Definition of Traces 



Since the definition of traces in Section [3] was a bit brief, we recall here the full 
construction. Consider an SC or a TSO computation a S Csc('P) U Ctso('P)- 
Its trace Tr(a) is a node-labelled graph (N, A, — » P o, — >st, — >src) with nodes A 7 ", 
labelling A : A 7 — >• ACT, and — > po , — >st, — s> 5rc C A 7 x A 7 the aforementioned relations 
that define the edges. The program order is a union of -^ po = UteTHRD — > PO of per 
thread total orders. The store ordering — ^ = U ae DOM gi yes a total order for 
the stores to each address. We use the syntax mai(->p ) and max(—>l t ) to access 
the maximal elements in these total orders. Finally, we have a source relation 
— > src from stores to loads. 

Traces are defined inductively, starting from the empty trace for the empty 
word e. Assume we already constructed Tr(a) = (N, A, — > po , — > st , — > src ). In the 
definition of Tr(a • act) := (N U {n}, A', — — >' st , — >' src ), the choice of n depends 
on the type of act. If we have a store, we use the moment the action was issued. 
Otherwise, we add a new node: 

act = (£, st,a, v) Let n be the minimal node in — s>p D labelled by X(n) — isu. We 

set A' := X[n := act] and -V po := — > po . 
act ^ (t, st, a, v) We add a fresh node n £ N to the trace, set A' := AU{(n, act)}, 

and ^' P o : = ^ P o U {(max(^p ),n)}. 

The store order is updated only for stores (t, st, a, «). We define — s^ t := — > st 
U {(mai(-> s a t ),n)}. The relation is not changed otherwise. The source relation 
is updated only for loads and stores. In case of a load (t, Id, a, v) we set — >' src 
:= — s> 5rc U {{max(— >-* t ), n)}. In case of a store (t, st, a, v) we update the source 
relation for loads that read from the store early: for all nodes m with n — m 
and A(m) = (t, Id, a, u) we set — >' src := (— > 5rc \ {(*, m)}) U {(n,m)}. 

Consider trace Tr(a) with a G CtsoC^)- The conflict relation — > c f from load to 
store actions makes cyclic accesses in the trace visible. We define Id — s> c f st if 
there is another store action st' in Tr(r) that satisfies st' — J> src Id and st' — > st st. 
If Id reads the initial value of an address and st overwrites it, we also have 
Id — >d st. The happens-before relation of a trace is a union of all four relations: 
^hb := ^ P o U -ha U -^ 5rc U -hrf. 

B Minimal Violations and Locality 

in our earlier work [8] we showed that in a minimal violation only one thread 
reorders its actions. Since we employ here a more elaborate programming model, 
this locality result has to be checked again. 

Consider a computation r = a - a- ft • 6-7 € Cjso^) with two actions a and 
b of the same thread thread{a) =t = thread(b). We define the distance d T (a,b) 
between a and b in r as the number of actions in /? that also belong to this 
thread: d T (a, b) :— \\ft\.t\\i en . The number of delays #(r) in computation r is the 
sum of distances between corresponding issue and store actions: 



#(t):= rfr(isu,st). 

corr. isu,st in r 



We call a violating computation r a minimal violation if it is has a minimal 
number of delays among all violating computations. Clearly, a program V has 
violating computations if and only if it has a minimal violation. 

The following lemma says that if a store action has been delayed, then it has 
been delayed past a load action of the same thread. Moreover, the load did not 
read the value of this store action early. 

Lemma 5. Consider a minimal violation r = a.- isu- ft ■ st-j € Cjsoi'P), where 
isu and st stem from the same instruction instance of thread t. Then j3\,t is either 
empty, or /3 \,t = /3' • Id • j3" where Id is a load action with addr(ld) ^ addr(st) 
and (3" contains only store actions. 

Proof. Suppose (3 contains one or more actions of thread t. If all actions of thread 
t in j3 are stores, then also r' = a ■ (3 ■ isu ■ st • 7 is a TSO computation of V . It 
has the same trace as r but #(r') < #(r), which contradicts minimality of r. 

Otherwise let a be the last non-store action in Pit, i.e., /3 = j3i ■ a ■ 02 and 
all actions in P2 are stores or belong to threads different from t. Since store 
actions cannot be delayed past a memory fence of the same thread, a is an issue 
action, a local action, or a load. In the former two cases, as well as if a is a load 
from addr{\&) = addr(st), delaying st past a can be avoided in the computation 
t' = a ■ isu • f3\ • /J2 • st • a ■ 7 of V. It has the same trace as r and #(t') < #(r), 
which contradicts minimality of r. □ 

In the remainder of the section, we develop a method to detect happens-before 
relations in a trace with the help of embedded computations. We relate two 
actions in a computation iff the corresponding nodes in the trace are related. To 
avoid case distinctions for issue and store actions that yield the same node in 
the trace, we introduce the issue relation — y lsu that links them: isu — >\ su st. We 
include — 5>i su into ^ht>- 

Definition 2 ( |8j ) . Let r = a-a-f3-b-j G CtsoCP) ■ We say a happens-before b 
through /3 if there is a (potentially empty) subsequence c\ . . .c n of (3 that satisfies 
(assuming Co := a and c n +i '■= b): 

a% -^hb a i+ i or a t -»+ a i+x for all i £ [0, n]. 

The next lemma states that the just defined relation is stable under insertion. 

Lemma 6 ( |8j ) . Consider computations r = a ■ a ■ (3 ■ b ■ 7 and t = a -a- f: i • 6 • 7 
in Crsoi'P) so — T> -It f or every thread t. Let (3 be a subsequence of (3' . 

Then if a — ^ b through (3 then a — ^ b through f3' . 

The following lemma says that if two actions in a minimal violation are not 
related via — »jj" b , they can be reordered without changing the trace and the order 
of actions within each thread. 

Lemma 7 (|8j). Consider a minimal violation T = a- a- f3-b-j<E CtsoCP)- 
Then (1) a — ^ b through j3 or (2) there is r' = a ■ (3% • b • a ■ f3% ■ 7 € Cjsoi'P) 
so that 7r(r) = Tr(r') and r\t = t' \.t for every thread t. 



Proof. We establish => (2). Note that this proves the disjunction since 

-i(2) (1) is the contrapositive. We proceed by induction on \ fi\ien and slightly 
strengthen the hypothesis: we also show that /3 2 is a subsequence of /3. 

Base case: fj8|j en = 0. Then r = a-a-b-j and a -/*hb b. If thread(a) = thread (b), 
then b — >^ D a. Therefore, 6 is a store action which has been delayed past a. 
Swapping a and b will save the delay without changing the trace, in contradiction 
to the minimality of r. 

If thread(a) ^ thread(b), then either at least one of the two actions is local, 
the actions access different addresses, or both are loads. In all cases swapping 
them produces r as required in the statement of the lemma. 

Step case: Assume the statement holds for |/3||; en < n. Consider r' = a-a-f3-b-j 
with \\(3\ien = n+1. Let c be the last action in (3 = f3' ■ c. Since a -/>~£ b b through 
(3, then a -/> bh c through f3' or c y^hb b. 

Let a y^-jjjj c. We apply the induction hypothesis to r with respect to a and c. 
This gives r = a ■ /3[ ■ c ■ a ■ j3 2 ■ b ■ 7 with the same trace and thread computations 
as r. Then, taking into account Lemma [51 we apply the hypothesis to r with 
respect to a and b. This yields r" = a ■ f3[ ■ c ■ (3' 21 ■ b ■ a ■ f3' 22 ■ 7 having the same 
trace and thread computations as r and r. Note that f3' 22 is a subsequence of 
/3' 2 , which in turn is a subsequence of f3' and hence of (3. 

Let c b. We apply the induction hypothesis to r with respect to b and c, 
getting t = a- a-f3'-b-c-^ with the same trace and thread computations as r. 
Applying it again to r' with respect to a and b gives t" = a ■ /3[ ■ b ■ a ■ /3' 2 ■ c ■ 7. 
The computation has the same trace and thread computations as r and r. Since 
(3' 2 is a subsequence of • c is a subsequence of /3. □ 

Lemma 8 (Locality |8j). /n a minimal violation, only a single thread delays 
stores. 

Proof. Consider a minimal violation r G Cjso('P) and suppose at least two 
threads delayed stores. By Lemma El each store was delayed past a load of the 
same thread. Let st 2 of thread t 2 be the overall last delayed store in r, and let 
Id2 be the last load of t 2 overstepped by st 2 - Similarly, let sti be the overall last 
delayed store in a thread t\ ^ t 2 . Let Idi be the last load overstepped by st±. 
The following fundamental mutual dispositions of reorderings are possible: 

1. t = 71 • isui • 72 • Id 1 • 73 • sti • 74 • isu 2 • 75 • ld 2 • 76 • st 2 • 77 

2. t = 71 • isui • 7 2 • Id 1 • 73 • isu 2 • 74 • ld 2 • 75 • st 2 • -y e ■ sti • 77 

3. r = 71 • isui • 7 2 • Id 1 • 73 • isu 2 • 74 • ld 2 • 75 • sti • 76 ■ st 2 • 77 

In these three computations every pair (Idi, sti) provides a happens-before cycle: 
St.; —t^o Idi and, by Lemma[7]and minimality, Idi — >fj" b sti through the appropriate 
subrange of r. 

In the first disposition r is not minimal, since it can be shortened to the 
violating computation r' = 71 • isui ■ 7 2 ■ Idi ■ 73 ■ sti ■ P with #(r') < #(r). Here, 
(3 contains only store actions of i 2 that complete earlier issue actions. 

In the second disposition r is not minimal either. Starting from Idi, thread 
t\ does not perform any actions, except delayed stores, until sti (Lemma [5]). 



Therefore, Id i and all program order later actions of t\ can be safely removed 
from t without affecting the happens-before cycle produced by ti . The resulting 
computation has a smaller number of delays (due to the removed Id i ) , but its 
trace still includes the cycle by £2- A contradiction to minimality of r. 

Lastly, in the third case r is also not minimal. First we delete 77. Then we 
erase all actions from j e that do not belong to t 2 : 7g = 76^2- By construction, 
the resulting computation r is a feasible TSO computation: 

t' = 71 • isui • 72 • Idi • 73 • isu 2 • 74 • ld 2 • 75 • sti • 7g • st 2 . 

Computation r' still contains the happens-before cycle sti — Id 1 — »jj~ b sti 
inherited from r. Since deleting actions cannot increase the number of delays, 
#(r') = #(r). Moreover, since r is a minimal violation, so is r' . 

By Lemma [3 ld 2 — >jj~ b st 2 through 75 • sti • j' 6 . By the choice of Id 1 and sti 
and in accordance with Lemma [SJ (73 • isii2 • 74 • Id2 • 75) i^i only contains delayed 
stores that were issued before Id 1 . By definition, 7 6 does not contain actions of 
t\ at all. Therefore, Id 1 is the program order last action of t\. It can be safely 
removed from r' without affecting the cycle of £2- The resulting computation is 

t" = 71 • isui • 72 • 73 ' isu 2 • 74 • ld 2 • 75 • sti • 7e • st 2 . 

Note that #(t") < #(t') = #(t), but computation r" still contains the cycle 
sta — >p H2 — »jj~ b st 2 . A contradiction to minimality of r. □ 

C Soundness and Completeness of the Instrumentation 

Theorem 8 (Soundness and Completeness). Attack A = (t>\, stinst, Idinst) 

is feasible in program V iff program Va reaches a goal configuration under SC. 

Proof. Soundness. Suppose the instrumented program reaches a goal config- 
uration. For simplicity, assume that it immediately stops after this. Then the 
computation of the instrumented program looks like this: 

t a = n • isu 5tA • st A JX • r 2 • ld A • t 3 • isu suc • st 5UC . 

The last action, st suc , is performed by a helper and sets variable sue to true, 
as required by the definition of goal configurations. This action originates from 
an instruction generated in accordance with (|13[) . To reach this instruction, the 
helper has to enter its code copy. 

As required by (J5J and ©, for the first helper to enter its code copy, the 
attacker must set a hb-variable to a non-zero value by executing Idinst (action 
ld A ) instrumented in accordance with <j2j) . For this, the attacker must enter its 
code copy and start performing stores to auxiliary addresses. Accordingly, the 
first attacker's store to an auxiliary address is denoted by st A JX in r. It stems 
from the instrumented stinst ([1]) and is located before Ha- 

We elaborate on the contents of n, T2, and T3. First, the attacker and helpers 
execute the code of the original program (helpers — with an additional check 
at every instruction, (|7|)). In T2 the helpers continue to execute the code of the 



original program. Shortly before performing Id a and stopping, the attacker sets 
variable hb to true thus forcing the helpers to enter their code copies. Therefore 
all actions in T3 belong to helpers that have entered their code copies. Also, 
t 2 only contains stores of the attacker to auxiliary addresses, and T3 does not 
contain attacker action at all, as follows from ([2]). 

We now turn ta into the following TSO witness computation: 

t = t[- isu stA • t 2 ■ ld A • T3 • st A • 7-4. 

Here, r{ is the subsequence of all t\ actions that are produced by instructions 
from V (this is T\ without the conditionals introduced in ([?]))• Computation t 2 
is the subsequence of all actions of T2 produced by instructions from V and by 
their clones in the code copy of the attacker, except the store actions to auxiliary 
address. These store actions constitute t' 4 . Finally, T3 is the subsequence of all 
helper actions of T3 produced by clones of instructions from V. We also strip the 
suffix d from the addresses of load and store actions in t 2 and T4 . 

That r is a computation of program V follows from the fact that ta is exe- 
cutable. We just removed actions produced by the instrumentation and replaced 
buffering by delaying of store actions; we did not change any data dependencies. 
The delaying of stA • t' a past IdA is possible because the attacker did not execute 
memory fences between st A JX and IdA, as guaranteed by ©. 

Let us check that r is a TSO witness (Figure [4j. (Wl) holds as in r indeed 
only the attacker delays stores. The first delayed store stA is an instance of 
stinst, load IdA is an instance of Idinst and is the last action of the attacker that 
is overstepped by delayed stores, (W2) holds. For each act in IdA • t 3 • stA it 
holds IdA — >hb act : (W3). This is by construction of helpers in accordance with 
Lemma [3] Computation t' 4 consists only of the stores delayed by the attacker, 
(W4). (W5) holds due to the check in ©. So, r is a TSO witness for attack A. 

Completeness. Suppose there is a TSO witness r for attack A as in Figure SJ 

t = ri • isu stA • r 2 • ld A • t 3 • st A • r 4 . 

We show that the instrumented program has an execution that leads to a goal 
state. In the beginning, the instrumented attacker and helper threads execute 
instructions of the original program, namely those in t\. The helpers actually 
execute these actions instrumented by ([7]), i.e., with an additional assert. These 
conditionals are executable because the attacker did not yet set variable hb. 

Then the attacker executes [stinstjAi (stinst is the instruction that produced 
isu 5tA in t) and enters the code copy. Now all its stores will be executed on 
auxiliary addresses, as defined in (TTJ) and (J3j> - This means, they stay invisible 
to the helper threads as they were in the computation of the original program. 
Also, the instrumentation of loads (j4]) makes sure that they read buffered values, 
if they exist. Altogether this preserves the data dependencies from the original 
computation. 

So the attacker executes the instructions that lie in t 2 , instrumented by 
[— ]a2- Note that T2 does not contain memory fences, otherwise stA could not have 



been delayed past IcIa in t. Therefore, © cannot provoke a block of the attacker. 
The helpers still execute the actions of the original program, instrumented by 
([7]). Finally, the attacker executes Idinst which produced IcIa in r, Equation @. 
This is possible due to (W5). 

All actions in T3 belong to helpers. By (W3), they are in happens-before 
relation with Id^ - Therefore, due to the instrumentation based on Lemma 
the helpers are able to enter their code copies, (JSJ and ©, and execute the 
instructions that produced r 3 . Note that the instrumentation of the code copy 
for helpers does not introduce any conditionals that could block the execution. 

At least one of the helper's actions in T3 performs a load or a store to the 
address used in stA- Otherwise, (W3) would not hold (IcJa and the delayed write 
of the attacker use different addresses by (W5)). When performing the action 
in the instrumented program, the helper will set the hb- variable for the address 
used in stA to a non-zero value, Equations (fTTj) and (IT^I) . Therefore, at the next 
step the helper will be able to set sue to true in accordance with (ITUl) and make 
the instrumented program reach a goal state. □ 

D Decidability and Complexity 

The reductions of robustness to reachability and parameterized reachability are 
independent of the number of addresses and the structure of the data domain. 
Hence, without further assumptions the resulting reachability queries cannot 
be guaranteed to be decidable. We now discuss conditions on address space and 
data domain that render robustness decidable. Note that we only have to restrict 
these two dimensions. The instrumentation copes with the unbounded size store 
buffers. Moreover, we choose the verification technology so that it handles the 
unbounded number of threads required in parameterized reachability. 

Parallel Programs with Finite Domains Consider a parallel program over 
a finite data domain, and hence finite address space. In this setting robustness is 
PSPACE-complete [8]. Our earlier proof is of complexity-theoretic nature: based 
on enumeration and not meant to be implemented. The instrumentation in this 
paper yields an alternative proof of membership in PSpace that is conceptually 
simpler and allows us to reuse all techniques that have been developed for finite 
state verification. 

Theorem 9. Robustness for parallel programs over finite domains is PSPACE- 
complete. 

Parameterized Programs with Finite Domains Consider parameterized 
programs over finite domains. In this setting, decidability of robustness was 
open (our techniques from [8] do not carry over) . With Theorem [4j we can now 
solve the problem and establish decidability. The key observation is that threads 
in instance programs never use their identifiers, simply because they are copies 
of the same source code. This means there is no need to track the identity of 
threads, it is sufficient to count how many instances of a thread are in each 
state — a technique known as counter abstraction [T3). Using this technique, 
we can reformulate the reachability problem for parameterized programs as a 
coverability problem for Petri nets. We briefly recall the basics on Petri nets. 



Definitions A Petri net is a triple N = (S, T, W) where S is a finite set of places, 
T is a finite set of transitions with S n T = 0, and VF: (5 x T) U (T x S) ->• N 
is a weight junction. A marking is a function that assigns a natural number 
to each place: M: 5 — ► N. A marked Petri net is a pair (iV, Mo) of a Petri 
net and an initial marking Mq. A transition t £ T is enabled in marking M if 
M(s) > W(s,t) for all s £ 5. The ^ring relation [) C N |s| x T x N |s| contains 
a tuple (Mi,t, M 2 ) if transition t is enabled in Mi and for all s G S we have 
M 2 (s) = Mi(s)-W(s,t) + W(t,s). We also write Mx[t)M 2 . We extend the firing 
relation to sequences of transitions. 

We say that a marking M is reachable in a marked Petri net (AT, Mq) if there 
is a transition sequence a G T*, such that Mo[c)M. A marking M is coverable 
if there is a reachable marking M' so that M(s)' > M(s) for all seS. 

Lemma 9 (|26j). TTie problem to determine whether a marking M is coverable 
in a marked Petri net (N, Mq) is decidable. 

Reduction of parameterized reachability to Petri net coverability Let 

V be a parameterized program with finite data domain DOM. We define a Petri 
net N = (S, T, W) simulating the program. 

For each pair of address and value (a, v) £ DOM x DOM we create a place s a>v . 
These places represent the state of the global memory: M(s a , v ) — 1 corresponds 
to val(a) = v. 

For each thread t, that declares registers Tl and has labels lj we create places 

for all I € \i and all v G DOM' ri ' . These places encode the number of thread 
instances in the given control state that have the given register valuation. 

For each thread t$ we create a transition t^. Let lo,j be the initial label of U, 
Tl be the registers declared by ti, and vqJ be a zero vector of length \rl\. Then 
we set W(tj,si i,wr) = 1- Transition U effectively spawns an arbitrary number 
of copies of thread ti that are all in the initial state. 

Next we create transitions that simulate the instructions in each thread. 
We explain the construction for load instructions. The other instructions are 
handled along similar lines. Consider thread ti with registers Tl and a labelled 
load instruction linst = li : r <— mem[/ a (r^")] ; goto l 2 ; . For each value v £ DOM 
and for each vector £ DOM''*' we create a transition t = tu ns t,v,v^- We set 
W{s\ l ^,X) = W(t,S] 2jW ^>) = 1 where v^' = v^[r := v]. Let a = f a {v^irZ). 
Then we set W(s a>v ,t) = W(t,s a>v ) = I. Transition t is enabled if there is an 
instance of the thread in control state is \± so that its register valuation is v reg 
and address a being read holds value v. Firing the transition only updates the 
state of the thread instance: its program counter is set to label l 2 , and the value 
of register r is set to v. 

We define the initial marking by Afo(s a ,o) = 1 for all a G DOM, and Mq(s) = 
for all other places s £ S. Reaching a goal configuration val(suc) = true in the 
parameterized program now corresponds to covering the following marking Af 5UC 
in the resulting Petri net: M suc (s 5UC! true) = 1 an d M suc (s) = for all other places 
s £ S. Combining this reduction with Lemma [S] gives Theorem If 01 

Theorem 10. Robustness for parameterized programs over finite domains is 
decidable. 



Lower bound The upper bound on robustness for parameterized programs 
depends on the data domain. Interestingly, an EXPSpace lower bound already 
holds for domains with two values. The proof reduces the coverability problem 
in Petri nets to robustness of parameterized programs. EXPSPACE-hardness of 
coverability is a classic result by Lipton [23]. That we can restrict ourselves to 
domains with two elements means the control flow in a parameterized program 
is expressive enough to encode the Petri net behaviour. 

The idea behind the construction is to take thread instances as tokens. Each 
thread has a label for each place in the Petri net, plus an additional label that 
indicates the token is currently not in use. The Petri net transitions are mimicked 
by a controller thread. It serialises the reading and writing of tokens, checks the 
coverability query, and then enters a non-robust situation. To read and write 
tokens, the controller communicates with the token threads via the memory. 
The construction requires locked instructions, which are immediate to add to 
our programming model. 

Theorem 11. Robustness for parameterized programs is EXPSPACE-Ziarri, for 
any domain with at least two elements. 



